Constanța County
FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models
Modoranu, Ionut-Vlad, Safaryan, Mher, Schultheis, Erik, Ryabinin, Max, Chumachenko, Artem, Alistarh, Dan
Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to improve running time and reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD) or QR-decomposition. Applying these techniques individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple, two-step procedure to approximate SVD/QR-based gradient projections into lower-dimensional spaces by using a predefined orthogonal matrix of the Discrete Cosine Transform (DCT). We dynamically select columns from the DCT matrix based on their alignment with the gradient of each layer. The effective projection matrices are obtained via a simple matmul with the DCT matrix in $O(n^3)$ time, followed by a lightweight sorting step to identify the most relevant basis vectors. For large layers, DCT can be computed via Makhoul's $N$-point algorithm based on Fast Fourier Transform (FFT) in $O(n^2 \log(n))$ time. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, obtaining an approach with rank-independent running time that matches the performance of costly SVD/QR-based methods while achieving faster runtime and reduced memory usage by up to $25\%$ across different model sizes. Our code is available at \href{https://github.com/IST-DASLab/ISTA-DASLab-Optimizers}{\texttt{https://github.com/IST-DASLab/ISTA-DASLab-Optimizers}}.
- Asia > Middle East > Jordan (0.04)
- Europe > Romania > Sud-Est Development Region > Constanța County > Constanța (0.04)
- Europe > Austria (0.04)
SupraTok: Cross-Boundary Tokenization for Enhanced Language Model Performance
Tănase, Andrei-Valentin, Pelican, Elena
Tokenization remains a fundamental yet underexplored bottleneck in natural language processing, with strategies largely static despite remarkable progress in model architectures. We present SupraTok, a novel tokenization architecture that reimagines subword segmentation through three innovations: cross-boundary pattern learning that discovers multi-word semantic units, entropy-driven data curation that optimizes training corpus quality, and multi-phase curriculum learning for stable convergence. Our approach extends Byte-Pair Encoding by learning "superword" tokens, coherent multi-word expressions that preserve semantic unity while maximizing compression efficiency. SupraTok achieves 31% improvement in English tokenization efficiency (5.91 versus 4.51 characters per token) compared to OpenAI's o200k tokenizer and 30% improvement over Google's Gemma 3 tokenizer (256k vocabulary), while maintaining competitive performance across 38 languages. When integrated with a GPT-2 scale model (124M parameters) trained on 10 billion tokens from the FineWeb-Edu dataset, SupraTok yields 8.4% improvement on HellaSWAG and 9.5% on MMLU benchmarks without architectural modifications. While these results are promising at this scale, further validation at larger model scales is needed. These findings suggest that efficient tokenization can complement architectural innovations as a path to improved language model performance.
- North America > United States > New York (0.04)
- Europe > Romania > Sud-Est Development Region > Constanța County > Constanța (0.04)
- Oceania > New Zealand (0.04)
- (3 more...)
- Research Report > New Finding (0.66)
- Research Report > Experimental Study (0.46)
Supernova: Achieving More with Less in Transformer Architectures
Tanase, Andrei-Valentin, Pelican, Elena
The transformer architecture [1] has fundamentally transformed natural language processing, establishing itself as the dominant paradigm for language modeling and understanding tasks. However, the field's trajectory toward ever-larger models has created significant computational and economic challenges. Contemporary models such as OpenAI's GPT series, Anthropic's Claude, and Google's Gemini have pushed parameter counts into the hundreds of billions, resulting in unprecedented infrastructure costs that increasingly exceed the economic value these models generate in many practical applications. This scaling trajectory has reached a critical inflection point where the marginal benefits of additional parameters diminish rapidly while computational requirements grow exponentially. Despite this economic reality, there has been surprisingly limited systematic exploration of compact, efficient transformer architectures that could deliver comparable performance at sustainable computational costs. The prevailing assumption that model quality scales monotonically with parameter count has created a significant research gap in the sub-billion parameter regime, leaving unexplored the potential for architectural innovation to compensate for reduced scale. In this work, we challenge this scaling paradigm by presenting Supernova, a 650M parameter decoder-only transformer that demonstrates how careful architectural design and tokenization innovation can achieve performance comparable to significantly larger models while maintaining computational efficiency. Our approach is grounded in three fundamental principles: architectural efficiency through modern component integration, superior tokenization design, and dramatic improvements in data efficiency. 1
- Europe > Romania > Sud-Est Development Region > Constanța County > Constanța (0.04)
- Asia > Middle East > Jordan (0.04)
SQuat: Subspace-orthogonal KV Cache Quantization
Wang, Hao, Han, Ligong, Xu, Kai, Srivastava, Akash
The key-value (KV) cache accelerates LLMs decoding by storing KV tensors from previously generated tokens. It reduces redundant computation at the cost of increased memory usage. To mitigate this overhead, existing approaches compress KV tensors into lower-bit representations; however, quantization errors can accumulate as more tokens are generated, potentially resulting in undesired outputs. In this paper, we introduce SQuat (Subspace-orthogonal KV cache quantization). It first constructs a subspace spanned by query tensors to capture the most critical task-related information. During key tensor quantization, it enforces that the difference between the (de)quantized and original keys remains orthogonal to this subspace, minimizing the impact of quantization errors on the attention mechanism's outputs. SQuat requires no model fine-tuning, no additional calibration dataset for offline learning, and is grounded in a theoretical framework we develop. Through numerical experiments, we show that our method reduces peak memory by 2.17 to 2.82, improves throughput by 2.45 to 3.60, and achieves more favorable benchmark scores than existing KV cache quantization algorithms.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Randomized Kaczmarz Methods with Beyond-Krylov Convergence
Dereziński, Michał, Needell, Deanna, Rebrova, Elizaveta, Yang, Jiaming
Randomized Kaczmarz methods form a family of linear system solvers which converge by repeatedly projecting their iterates onto randomly sampled equations. While effective in some contexts, such as highly over-determined least squares, Kaczmarz methods are traditionally deemed secondary to Krylov subspace methods, since this latter family of solvers can exploit outliers in the input's singular value distribution to attain fast convergence on ill-conditioned systems. In this paper, we introduce Kaczmarz++, an accelerated randomized block Kaczmarz algorithm that exploits outlying singular values in the input to attain a fast Krylov-style convergence. Moreover, we show that Kaczmarz++ captures large outlying singular values provably faster than popular Krylov methods, for both over- and under-determined systems. We also develop an optimized variant for positive semidefinite systems, called CD++, demonstrating empirically that it is competitive in arithmetic operations with both CG and GMRES on a collection of benchmark problems. To attain these results, we introduce several novel algorithmic improvements to the Kaczmarz framework, including adaptive momentum acceleration, Tikhonov-regularized projections, and a memoization scheme for reusing information from previously sampled equation~blocks.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > Michigan (0.04)
- South America > Uruguay > Maldonado > Maldonado (0.04)
- (2 more...)
Towards nation-wide analytical healthcare infrastructures: A privacy-preserving augmented knee rehabilitation case study
Bačić, Boris, Vasile, Claudiu, Feng, Chengwei, Ciucă, Marian G.
The purpose of this paper is to contribute towards the near-future privacy-preserving big data analytical healthcare platforms, capable of processing streamed or uploaded timeseries data or videos from patients. The experimental work includes a real-life knee rehabilitation video dataset capturing a set of exercises from simple and personalised to more general and challenging movements aimed for returning to sport. To convert video from mobile into privacy-preserving diagnostic timeseries data, we employed Google MediaPipe pose estimation. The developed proof-of-concept algorithms can augment knee exercise videos by overlaying the patient with stick figure elements while updating generated timeseries plot with knee angle estimation streamed as CSV file format. For patients and physiotherapists, video with side-to-side timeseries visually indicating potential issues such as excessive knee flexion or unstable knee movements or stick figure overlay errors is possible by setting a-priori knee-angle parameters. To address adherence to rehabilitation programme and quantify exercise sets and repetitions, our adaptive algorithm can correctly identify (91.67%-100%) of all exercises from side- and front-view videos. Transparent algorithm design for adaptive visual analysis of various knee exercise patterns contributes towards the interpretable AI and will inform near-future privacy-preserving, non-vendor locking, open-source developments for both end-user computing devices and as on-premises non-proprietary cloud platforms that can be deployed within the national healthcare system.
- Oceania > New Zealand > North Island > Auckland Region > Auckland (0.06)
- North America > United States > Illinois > Champaign County > Champaign (0.04)
- Europe > Romania > Sud-Est Development Region > Constanța County > Constanța (0.04)
- Health & Medicine (1.00)
- Information Technology > Services (0.67)
- Information Technology > Data Science > Data Mining > Big Data (1.00)
- Information Technology > Artificial Intelligence (1.00)
LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics
Robert, Thomas, Safaryan, Mher, Modoranu, Ionut-Vlad, Alistarh, Dan
We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training. This strategy keeps the optimizer's memory footprint to a fraction of the model size. LDAdam relies on a new projection-aware update rule for the optimizer states that allows for transitioning between subspaces, i.e., estimation of the statistics of the projected gradients. To mitigate the errors due to low-rank projection, LDAdam integrates a new generalized error feedback mechanism, which explicitly accounts for both gradient and optimizer state compression. We prove the convergence of LDAdam under standard assumptions, and show that LDAdam allows for accurate and efficient fine-tuning and pre-training of language models.
- North America > Mexico > Gulf of Mexico (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Romania > Sud-Est Development Region > Constanța County > Constanța (0.04)
- (4 more...)
Graph Transformers Dream of Electric Flow
Cheng, Xiang, Carin, Lawrence, Sra, Suvrit
We show theoretically and empirically that the linear Transformer, when applied to graph data, can implement algorithms that solve canonical problems such as electric flow and eigenvector decomposition. The input to the Transformer is simply the graph incidence matrix; no other explicit positional encoding information is provided. We present explicit weight configurations for implementing each such graph algorithm, and we bound the errors of the constructed Transformers by the errors of the underlying algorithms. Our theoretical findings are corroborated by experiments on synthetic data. Additionally, on a real-world molecular regression task, we observe that the linear Transformer is capable of learning a more effective positional encoding than the default one based on Laplacian eigenvectors. Our work is an initial step towards elucidating the inner-workings of the Transformer for graph data.
- North America > United States (0.14)
- Europe > Romania > Sud-Est Development Region > Constanța County > Constanța (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Asia > Singapore (0.04)
Reddit is all you need: Authorship profiling for Romanian
Ştefănescu, Ecaterina, Jerpelea, Alexandru-Iulius
Authorship profiling is the process of identifying an author's characteristics based on their writings. This centuries old problem has become more intriguing especially with recent developments in Natural Language Processing (NLP). In this paper, we introduce a corpus of short texts in the Romanian language, annotated with certain author characteristic keywords; to our knowledge, the first of its kind. In order to do this, we exploit a social media platform called Reddit. We leverage its thematic community-based structure (subreddits structure), which offers information about the author's background. We infer an user's demographic and some broad personal traits, such as age category, employment status, interests, and social orientation based on the subreddit and other cues. We thus obtain a 23k+ samples corpus, extracted from 100+ Romanian subreddits. We analyse our dataset, and finally, we fine-tune and evaluate Large Language Models (LLMs) to prove baselines capabilities for authorship profiling using the corpus, indicating the need for further research in the field. We publicly release all our resources.
- Europe > Romania > Vest Development Region > Timiș County > Timișoara (0.05)
- South America > Argentina > Pampas > Buenos Aires F.D. > Buenos Aires (0.04)
- Europe > Romania > Sud-Vest Oltenia Development Region > Dolj County > Craiova (0.04)
- (14 more...)
Relational Graph Convolutional Networks Do Not Learn Sound Rules
Morris, Matthew, Cucala, David J. Tena, Grau, Bernardo Cuenca, Horrocks, Ian
Graph neural networks (GNNs) are frequently used to predict missing facts in knowledge graphs (KGs). Motivated by the lack of explainability for the outputs of these models, recent work has aimed to explain their predictions using Datalog, a widely used logic-based formalism. However, such work has been restricted to certain subclasses of GNNs. In this paper, we consider one of the most popular GNN architectures for KGs, R-GCN, and we provide two methods to extract rules that explain its predictions and are sound, in the sense that each fact derived by the rules is also predicted by the GNN, for any input dataset. Furthermore, we provide a method that can verify that certain classes of Datalog rules are not sound for the R-GCN. In our experiments, we train R-GCNs on KG completion benchmarks, and we are able to verify that no Datalog rule is sound for these models, even though the models often obtain high to near-perfect accuracy. This raises some concerns about the ability of R-GCN models to generalise and about the explainability of their predictions. We further provide two variations to the training paradigm of R-GCN that encourage it to learn sound rules and find a trade-off between model accuracy and the number of learned sound rules.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Romania > Sud-Est Development Region > Constanța County > Constanța (0.04)
- North America > United States > California > Monterey County > Monterey (0.04)
- (2 more...)
- Personal > Honors (0.67)
- Research Report > New Finding (0.48)